{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8dd0acdb-5aec-4129-8772-81f56d6b25cf",
   "metadata": {},
   "source": [
    "# Sub-Document Summary Metadata Pack\n",
    "\n",
    "This LlamaPack provides an advanced technique for injecting each chunk with \"sub-document\" metadata. This context augmentation technique is helpful for both retrieving relevant context and for synthesizing correct answers.\n",
    "\n",
    "It is a step beyond simply adding a summary of the document as the metadata to each chunk. Within a long document, there can be multiple distinct themes, and we want each chunk to be grounded in global but relevant context."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66818da6-a3fb-4537-b30a-922a8a0ef99e",
   "metadata": {},
   "source": [
    "## Setup Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "317a3207-1211-4a6a-bd7d-3ab14f399951",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "811.82s - pydevd: Sending message related to process being replaced timed-out after 5 seconds\n",
      "817.00s - pydevd: Sending message related to process being replaced timed-out after 5 seconds\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100 13.0M  100 13.0M    0     0  27.7M      0 --:--:-- --:--:-- --:--:-- 28.0M\n"
     ]
    }
   ],
   "source": [
    "!mkdir -p 'data/'\n",
    "!curl 'https://arxiv.org/pdf/2307.09288.pdf' -o 'data/llama2.pdf'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf6ab9c0-c993-4ab2-8343-b294676d7550",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import SimpleDirectoryReader\n",
    "\n",
    "documents = SimpleDirectoryReader(\"data\").load_data()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98bfbe4b-539c-469c-82e6-1f823f28d5f4",
   "metadata": {},
   "source": [
    "## Run the Sub-Document Summary Metadata Pack"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "af4b815e-f5ce-406b-9dcb-5a23fc9f96db",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-packs-subdoc-summary llama-index-llms-openai llama-index-embeddings-openai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d619362b-ae45-4e47-b400-1c2ce7262496",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.packs.subdoc_summary import SubDocSummaryPack\n",
    "from llama_index.llms.openai import OpenAI\n",
    "from llama_index.embeddings.openai import OpenAIEmbedding\n",
    "\n",
    "subdoc_summary_pack = SubDocSummaryPack(\n",
    "    documents,\n",
    "    parent_chunk_size=8192,  # default,\n",
    "    child_chunk_size=512,  # default\n",
    "    llm=OpenAI(model=\"gpt-3.5-turbo\"),\n",
    "    embed_model=OpenAIEmbedding(),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb11a60d-d356-40c5-84c1-4135382bfbfd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Llama 2 was pretrained using an optimized auto-regressive transformer with robust data cleaning, updated data mixes, training on 40% more total tokens, doubling the context length, and using grouped-query attention to improve inference scalability for larger models."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** 172a1344-d48d-443b-8383-677037570c06<br>**Similarity:** 0.8720929924174893<br>**Text:** page_label: 1\n",
       "file_name: llama2.pdf\n",
       "file_path: data/llama2.pdf\n",
       "file_type: application/pdf\n",
       "file_size: 13661300\n",
       "creation_date: 2024-02-17\n",
       "last_modified_date: 2024-02-17\n",
       "last_accessed_date: 2024-02-17\n",
       "context_summary: Llama 2 is a collection of pretrained and fine-tuned large language models optimized for dialogue use cases, ranging from 7 billion to 70 billion parameters. The models, known as Llama 2-Chat, have shown superior performance compared to open-source chat models on various benchmarks and are considered as potential alternatives to closed-source models.\n",
       "\n",
       "Llama 2 : Open Foundation and Fine-Tuned Chat Models\n",
       "Hugo Touvron∗Louis Martin†Kevin Stone†\n",
       "Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra\n",
       "Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen\n",
       "Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller\n",
       "Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou\n",
       "Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev\n",
       "Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich\n",
       "Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra\n",
       "Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi\n",
       "Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang\n",
       "Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang\n",
       "Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic\n",
       "Sergey Edunov Thomas Scialom∗\n",
       "GenAI, Meta\n",
       "Abstract\n",
       "In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned\n",
       "large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.\n",
       "Our fine-tuned LLMs, called Llama 2-Chat , are optimized for dialogue use cases. Our\n",
       "models outperform open-source chat models on most benchmarks we tested, and based on\n",
       "ourhumanevaluationsforhelpfulnessandsafety,maybeasuitablesubstituteforclosed-\n",
       "source models. We provide a detailed description of our approach to fine-tuning and safety\n",
       "improvements of Llama 2-Chat in order to enable the community to build on our work and\n",
       "contribute to the responsible development of LLMs.<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** dbbde2a7-d51c-4245-959d-ba97ba414b55<br>**Similarity:** 0.8700958215249326<br>**Text:** page_label: 5\n",
       "file_name: llama2.pdf\n",
       "file_path: data/llama2.pdf\n",
       "file_type: application/pdf\n",
       "file_size: 13661300\n",
       "creation_date: 2024-02-17\n",
       "last_modified_date: 2024-02-17\n",
       "last_accessed_date: 2024-02-17\n",
       "context_summary: Llama 2-Chat is developed through pretraining, supervised fine-tuning, and reinforcement learning with human feedback methodologies, focusing on refining the model iteratively. The training process involves using an optimized auto-regressive transformer, robust data cleaning, updated data mixes, and specific architectural enhancements like increased context length and grouped-query attention.\n",
       "\n",
       "Figure4: Trainingof Llama 2-Chat : Thisprocessbeginswiththe pretraining ofLlama 2 usingpublicly\n",
       "availableonlinesources. Followingthis,wecreateaninitialversionof Llama 2-Chat throughtheapplication\n",
       "ofsupervised fine-tuning . Subsequently, the model is iteratively refined using Reinforcement Learning\n",
       "with Human Feedback (RLHF) methodologies, specifically through rejection sampling and Proximal Policy\n",
       "Optimization(PPO).ThroughouttheRLHFstage,theaccumulationof iterativerewardmodelingdata in\n",
       "parallel with model enhancements is crucial to ensure the reward models remain within distribution.\n",
       "2 Pretraining\n",
       "Tocreatethenewfamilyof Llama 2models,webeganwiththepretrainingapproachdescribedinTouvronetal.\n",
       "(2023), using an optimized auto-regressive transformer, but made several changes to improve performance.\n",
       "Specifically,weperformedmorerobustdatacleaning,updatedourdatamixes,trainedon40%moretotal\n",
       "tokens,doubledthecontextlength,andusedgrouped-queryattention(GQA)toimproveinferencescalability\n",
       "for our larger models. Table 1 compares the attributes of the new Llama 2 models with the Llama 1 models.\n",
       "2.1 Pretraining Data\n",
       "Our training corpus includes a new mix of data from publicly available sources, which does not include data\n",
       "fromMeta’sproductsorservices. Wemadeanefforttoremovedatafromcertainsitesknowntocontaina\n",
       "highvolumeofpersonalinformationaboutprivateindividuals. Wetrainedon2trilliontokensofdataasthis\n",
       "providesagoodperformance–costtrade-off,up-samplingthemostfactualsourcesinanefforttoincrease\n",
       "knowledge and dampen hallucinations.\n",
       "Weperformedavarietyofpretrainingdatainvestigationssothatuserscanbetterunderstandthepotential\n",
       "capabilities and limitations of our models; results can be found in Section 4.1.\n",
       "2.2 Training Details\n",
       "We adopt most of the pretraining setting and model architecture from Llama 1 .<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import Markdown, display\n",
    "from llama_index.core.response.notebook_utils import display_source_node\n",
    "\n",
    "response = subdoc_summary_pack.run(\"How was Llama2 pretrained?\")\n",
    "display(Markdown(str(response)))\n",
    "for n in response.source_nodes:\n",
    "    display_source_node(n, source_length=10000, metadata_mode=\"all\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1181af9d-680f-4ba3-89e2-f88b12a89cc7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "The latest ChatGPT model, equipped with Ghost Attention (GAtt), demonstrates strong multi-turn memory ability by consistently referring to defined attributes for up to 20 turns in a conversation. This integration of GAtt in the ChatGPT model allows for efficient long context attention beyond 2048 tokens, showcasing potential for robust performance in handling extended contexts."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** 005a3c23-8d97-4e5d-957e-98ad2dfb93ad<br>**Similarity:** 0.7923889627946064<br>**Text:** page_label: 54\n",
       "file_name: llama2.pdf\n",
       "file_path: data/llama2.pdf\n",
       "file_type: application/pdf\n",
       "file_size: 13661300\n",
       "creation_date: 2024-02-17\n",
       "last_modified_date: 2024-02-17\n",
       "last_accessed_date: 2024-02-17\n",
       "context_summary: Llama 2-Chat with GAtt consistently refers to defined attributes for up to 20 turns, showcasing strong multi-turn memory ability. The integration of GAtt in Llama 2-Chat enables efficient long context attention beyond 2048 tokens, indicating potential for robust performance in handling extended contexts.\n",
       "\n",
       "Dialogue Turn Baseline + GAtt\n",
       "2 100% 100%\n",
       "4 10% 100%\n",
       "6 0% 100%\n",
       "20 0% 100%\n",
       "Table30: GAttresults. Llama 2-Chat withGAttisabletorefertoattributes100%ofthetime,forupto20\n",
       "turns from our human evaluation. We limited the evaluated attributes to public figures and hobbies.\n",
       "Theattentionnowspansbeyond20turns. Wetestedthemodelabilitytorememberthesystemarguments\n",
       "troughahumanevaluation. Thearguments(e.g. hobbies,persona)aredefinedduringthefirstmessage,and\n",
       "then from turn 2 to 20. We explicitly asked the model to refer to them (e.g. “What is your favorite hobby?”,\n",
       "“Whatisyourname?”),tomeasurethemulti-turnmemoryabilityof Llama 2-Chat . Wereporttheresults\n",
       "inTable30. EquippedwithGAtt, Llama 2-Chat maintains100%accuracy,alwaysreferringtothedefined\n",
       "attribute,andso,upto20turns(wedidnotextendthehumanevaluationmore,andalltheexampleshad\n",
       "lessthan4048tokensintotalovertheturns). Asacomparison, Llama 2-Chat withoutGAttcannotanymore\n",
       "refer to the attributes after only few turns: from 100% at turn t+1, to 10% at turn t+3 and then 0%.\n",
       "GAttZero-shotGeneralisation. Wetriedatinferencetimetosetconstrainnotpresentinthetrainingof\n",
       "GAtt. For instance, “answer in one sentence only”, for which the model remained consistent, as illustrated in\n",
       "Figure 28.\n",
       "We applied first GAtt to Llama 1 , which was pretrained with a context length of 2048 tokens and then\n",
       "fine-tuned with 4096 max length. We tested if GAtt works beyond 2048 tokens, and the model arguably\n",
       "managed to understand attributes beyond this window. This promising result indicates that GAtt could be\n",
       "adapted as an efficient technique for long context attention.\n",
       "A.3.6 How Far Can Model-Based Evaluation Go?<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** 0b1719e9-d7fa-42af-890b-5eeb946857c5<br>**Similarity:** 0.7837282816384877<br>**Text:** page_label: 16\n",
       "file_name: llama2.pdf\n",
       "file_path: data/llama2.pdf\n",
       "file_type: application/pdf\n",
       "file_size: 13661300\n",
       "creation_date: 2024-02-17\n",
       "last_modified_date: 2024-02-17\n",
       "last_accessed_date: 2024-02-17\n",
       "context_summary: The text discusses the challenges faced in maintaining multi-turn consistency in dialogue systems and introduces a method called Ghost Attention (GAtt) to address these issues. GAtt involves incorporating instructions throughout a conversation to ensure dialogue control over multiple turns.\n",
       "\n",
       "Figure 9: Issues with multi-turn memory (left)can be improved with GAtt (right).\n",
       "We train for between 200and400iterations for all our models, and use evaluations on held-out prompts for\n",
       "earlystopping. EachiterationofPPOonthe70Bmodeltakesonaverage ≈330seconds. Totrainquicklywith\n",
       "large batch sizes, we use FSDP (Zhao et al., 2023). This was effective when using O(1) forward or backward\n",
       "passes,butcausedalargeslowdown( ≈20×)duringgeneration,evenwhenusingalargebatchsizeandKV\n",
       "cache. We were able to mitigate this by consolidating the model weights to each node once before generation\n",
       "and then freeing the memory after generation, resuming the rest of the training loop.\n",
       "3.3 System Message for Multi-Turn Consistency\n",
       "In a dialogue setup, some instructions should apply for all the conversation turns, e.g., to respond succinctly,\n",
       "or to“act as”some public figure. When we provided such instructions to Llama 2-Chat , the subsequent\n",
       "response should always respect the constraint. However, our initial RLHF models tended to forget the initial\n",
       "instruction after a few turns of dialogue, as illustrated in Figure 9 (left).\n",
       "To address these limitations, we propose Ghost Attention (GAtt), a very simple method inspired by Context\n",
       "Distillation (Bai et al., 2022b) that hacks the fine-tuning data to help the attention focus in a multi-stage\n",
       "process. GAtt enables dialogue control over multiple turns, as illustrated in Figure 9 (right).\n",
       "GAttMethod. Assumewe haveaccess toa multi-turndialoguedataset betweentwo persons(e.g., auser\n",
       "and an assistant), with a list of messages [u1, a1, . . . , u n, an], where unandancorrespond to the user and\n",
       "assistant messages for turn n, respectively. Then, we define an instruction, inst, that should be respected\n",
       "throughout the dialogue. For example, instcould be “act as.” We can then synthetically concatenate this\n",
       "instruction to all the user messages of the conversation.\n",
       "Next, we can sample from this synthetic data using the latest RLHF model.<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import Markdown, display\n",
    "\n",
    "response = subdoc_summary_pack.run(\n",
    "    \"What is the functionality of latest ChatGPT memory.\"\n",
    ")\n",
    "display(Markdown(str(response)))\n",
    "\n",
    "for n in response.source_nodes:\n",
    "    display_source_node(n, source_length=10000, metadata_mode=\"all\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama_index_v3",
   "language": "python",
   "name": "llama_index_v3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
